Road accidents has been a critical problem as every year more than 1.2M people die across the globle. There is a pressing need to make use of the data and understand the underlying cause of problem. Road safety issues are complex. There are significant differences in policies within and across the countries. In this analysis, the data from Metropolitan Police Department's (MPD) crash data management system (COBALT) is studied to find relationship between fatality and independent features. The crash data is for DC state.
Each year, more than 1.2 million people die across the globe due to road crashes; there is a pressing need to understand the underlying cause of the problem. As road safety issues are complex; it involves multi-sectorial ranging from the public, stakeholders to the policy makers. Significant differences exist both across and within countries and therefore policies and interventions need to be adapted to the local environment. The effectiveness of interventions requires a multi-disciplinary approach which include enforcement, engineering and psychological and education approaches. While the resources are limited, road safety interventions must not only address the sustainability of the outcomes but also the cost-effectiveness to implement and maintain it. More important, interventions must be evidence-based and can be evaluated over time before it is translated into policy. Hence, the research cannot be done in silo for better addressing the complexity of road safety issues. For sustainability, road safety interventions need to be guided and governed by policy in the implementation and development.
data.describe()
| id | crime id | person id | age | |
|---|---|---|---|---|
| count | 5.963810e+05 | 5.963810e+05 | 5.963810e+05 | 426744.000000 |
| mean | 4.384924e+08 | 2.672116e+07 | 8.506922e+07 | 38.668302 |
| std | 1.721813e+05 | 1.238390e+06 | 8.613766e+06 | 20.897059 |
| min | 4.370014e+08 | 2.341134e+07 | 1.045383e+07 | -7990.000000 |
| 25% | 4.383433e+08 | 2.532167e+07 | 8.474899e+07 | 27.000000 |
| 50% | 4.384924e+08 | 2.680585e+07 | 8.497752e+07 | 37.000000 |
| 75% | 4.386415e+08 | 2.769386e+07 | 8.712287e+07 | 51.000000 |
| max | 4.387906e+08 | 2.872803e+07 | 9.077153e+07 | 237.000000 |
Age variables has lot of outliers. From the box plot values below 0 and above 85 are considered as outliers and replaced by na
data['age'] = np.where(data['age']<1, np.nan, data['age'])
fig = px.histogram(data, x="age", marginal="box")
fig.show()
data['age'] = np.where(data['age']>85, np.nan, data['age'])
Only age column shows 30% of missing data which is replaced by 0
In this analysis, fatal is the target variable. As indicated below the dataset is heavily imbalanced which is treated futhur. Only 0.06% (~417) are fatal
print("Percentage of non fatal class {} and fatal class {} ".format(
round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][0]/ data.shape[0])*100, 2),
round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][1]/ data.shape[0])*100, 2)))
print("Percentage of non major class {} and major class {} ".format(
round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][0]/ data.shape[0])*100, 2),
round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][1]/ data.shape[0])*100, 2)))
print("Percentage of non minor class {} and minor class {} ".format(
round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][0]/ data.shape[0])*100, 2),
round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][1]/ data.shape[0])*100, 2)))
Percentage of non fatal class 99.93 and fatal class 0.07 Percentage of non major class 96.42 and major class 3.58 Percentage of non minor class 88.93 and minor class 11.07
crime = pd.DataFrame(data[['crime id', 'severity']].value_counts())
crime = crime.reset_index()
crime.columns = ['crime id', 'severity', 'count of people']
fig = px.histogram(crime, x="count of people", color='severity', marginal="box")
fig.show()
vhCount = pd.DataFrame(data[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True)
vhCount.columns = ['vehicle type', 'severity','count']
fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
title="Number of accidents by vehicle type",
template="simple_white")
# fig.layout.update(showlegend=False)
fig.show()
vhCount = pd.DataFrame(data[['License state']].value_counts())
vhCount
vhCount.reset_index(inplace=True)
vhCount.columns = ['License state', 'count']
fig = go.Figure(data=go.Choropleth(
locations=vhCount['License state'], # Spatial coordinates
z = vhCount['count'], # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Accidents",
))
fig.update_layout(
title_text = 'Accidents by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
As people are not familiar with the terrain, we observe that people holding licenses from different states have higher fatality rate
stateFatality = data[data['fatality rate']>0.0000]
stateFatalityAvg = stateFatality.groupby(['License state'], as_index=False).agg({'fatality rate':'mean'})
stateFatalityAvg
fig = go.Figure(data=go.Choropleth(
locations=stateFatalityAvg['License state'], # Spatial coordinates
z = stateFatalityAvg['fatality rate'], # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Accidents",
))
fig.update_layout(
title_text = 'Average fatality rate by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
# fatality rate and age
fig = px.scatter(data, x="age", y="fatality rate", color='severity')
fig.show()
ptCount = pd.DataFrame(data[['person type', 'severity']].value_counts())
ptCount.reset_index(inplace=True)
ptCount.columns = ['person type', 'severity','count']
# fatal = ptCount[ptCount['fatal'] == 'Y']
# nonfatal = ptCount[ptCount['fatal'] == 'N']
# # x = ptCount['person type']
# # y = ptCount['count']
fig = px.bar(ptCount, x='person type', y='count', color='severity',
title="Number of people met with an accident by person type",
template="simple_white",
category_orders={ # replaces default order by column name
"person type": ["Driver", "Passenger", "Pedestrian", "Bicyclist"]
},)
percentage = pd.crosstab(data['person type'], data['severity']).apply(lambda r: (r/r.sum())*100, axis=1)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, <br> Minor: {}".format(
(round(percentage['fatal']['Driver'], 2)),
(round(percentage['major']['Driver'], 2)),
(round(percentage['minor']['Driver'], 2))
), x="Driver", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {},<br> Minor: {}".format(
(round(percentage['fatal']['Passenger'], 2)),
(round(percentage['major']['Passenger'], 2)),
(round(percentage['minor']['Passenger'], 2))
), x="Passenger", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {},<br> Minor: {}".format(
(round(percentage['fatal']['Pedestrian'], 2)),
(round(percentage['major']['Pedestrian'], 2)),
(round(percentage['minor']['Pedestrian'], 2))
), x="Pedestrian", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, <br> Minor: {}".format(
(round(percentage['fatal']['Bicyclist'], 2)),
(round(percentage['major']['Bicyclist'], 2)),
(round(percentage['minor']['Bicyclist'], 2))
), x="Bicyclist", y=70000, showarrow=False)
# fig.layout.update(showlegend=False)
fig.show()
driver = data[data['person type'] == 'Driver']
driver = driver[driver['age'] >1]
fig = px.histogram(driver, x="age", color='severity', marginal="box")
fig.show()
vhCount = pd.DataFrame(driver[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True)
vhCount.columns = ['vehicle type', 'severity','count']
fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
title="Number of accidents by vehicle type",
template="simple_white")
# fig.layout.update(showlegend=False)
fig.show()
passenger = data[data['person type'] == 'Passenger']
passenger = passenger[passenger['age'] >1]
fig = px.histogram(passenger, x="age", color='severity', marginal="box")
fig.show()
vhCount = pd.DataFrame(passenger[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True)
vhCount.columns = ['vehicle type', 'severity','count']
fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
title="Number of accidents by vehicle type",
template="simple_white")
# fig.layout.update(showlegend=False)
fig.show()
pedestrian = data[data['person type'] == 'Pedestrian']
pedestrian = pedestrian[pedestrian['age'] >1]
fig = px.histogram(pedestrian, x="age", color='severity', marginal="box")
fig.show()
dython.nominal.associations(a[['severity', 'fatal', 'fatality rate','vehicle type', 'ticket issued','Region', 'impaired','age', 'speeding', 'person type']], theil_u=True)
{'corr': severity fatal fatality rate vehicle type ticket issued \
severity 1.000000 1.000000 0.796636 0.075096 0.066784
fatal 1.000000 1.000000 0.796635 0.398397 -0.004870
fatality rate 0.796636 0.796635 1.000000 0.363773 0.068440
vehicle type 0.122996 0.398397 0.363773 1.000000 0.225724
ticket issued 0.066784 -0.004870 0.068440 0.225724 1.000000
Region 0.036209 0.233933 0.176967 0.199513 0.106797
impaired 0.084922 0.084922 0.069374 0.126110 0.020913
age 0.149978 0.112146 0.184820 0.154240 0.121119
speeding 0.208334 0.207661 0.262471 0.366904 0.170605
person type 0.087195 0.304221 0.265939 0.267597 0.240737
Region impaired age speeding person type
severity 0.048642 0.084922 0.149978 0.208334 0.090719
fatal 0.233933 0.084922 0.112146 0.207661 0.304221
fatality rate 0.176967 0.069374 0.184820 0.262471 0.265939
vehicle type 0.438978 0.126110 0.154240 0.366904 0.455998
ticket issued 0.106797 0.020913 0.121119 0.170605 0.240737
Region 1.000000 0.084674 0.125845 0.086361 0.293907
impaired 0.084674 1.000000 0.008789 -0.021951 0.131849
age 0.125845 0.008789 1.000000 0.040347 0.482570
speeding 0.086361 -0.021951 0.040347 1.000000 0.196817
person type 0.379490 0.131849 0.482570 0.196817 1.000000 ,
'ax': <matplotlib.axes._subplots.AxesSubplot at 0x1ef425cab08>}
X = a[['vehicle type', 'age', 'Region', 'speeding', 'person type']]
y = a[['fatality rate']]
X = pd.get_dummies(data = X)
X['age'] = X['age'].fillna(0)
import statsmodels.api as sm
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model
# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)
# Print out the statistics
model.summary()
| Dep. Variable: | fatality rate | R-squared: | 0.213 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.189 |
| Method: | Least Squares | F-statistic: | 9.103 |
| Date: | Sun, 04 Apr 2021 | Prob (F-statistic): | 6.27e-29 |
| Time: | 19:57:53 | Log-Likelihood: | -3873.3 |
| No. Observations: | 834 | AIC: | 7797. |
| Df Residuals: | 809 | BIC: | 7915. |
| Df Model: | 24 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 15.8479 | 4.229 | 3.747 | 0.000 | 7.547 | 24.149 |
| age | 0.2195 | 0.043 | 5.069 | 0.000 | 0.134 | 0.304 |
| speeding | 25.1665 | 3.985 | 6.316 | 0.000 | 17.345 | 32.988 |
| vehicle type_Atv (all Terrain Vehicle) | 30.7411 | 24.190 | 1.271 | 0.204 | -16.741 | 78.223 |
| vehicle type_Bus | -19.8331 | 8.981 | -2.208 | 0.028 | -37.462 | -2.204 |
| vehicle type_Cargo Van | 5.4682 | 17.243 | 0.317 | 0.751 | -28.378 | 39.314 |
| vehicle type_Drugs/ Narcotics | 8.4546 | 8.991 | 0.940 | 0.347 | -9.193 | 26.102 |
| vehicle type_Firearms | -8.7411 | 5.244 | -1.667 | 0.096 | -19.035 | 1.553 |
| vehicle type_Large/heavy Truck | -6.9540 | 9.640 | -0.721 | 0.471 | -25.876 | 11.968 |
| vehicle type_Moped/scooter | 13.5547 | 10.419 | 1.301 | 0.194 | -6.896 | 34.006 |
| vehicle type_Motor Cycle | 23.6071 | 5.360 | 4.404 | 0.000 | 13.085 | 34.129 |
| vehicle type_Motorhome/camper/rv (recreational Vehicle) | -18.7633 | 17.241 | -1.088 | 0.277 | -52.605 | 15.079 |
| vehicle type_None | 7.0710 | 11.085 | 0.638 | 0.524 | -14.687 | 28.829 |
| vehicle type_Other Small/light Truck | 3.1306 | 17.321 | 0.181 | 0.857 | -30.868 | 37.129 |
| vehicle type_Other Vehicle | -3.7756 | 3.952 | -0.955 | 0.340 | -11.533 | 3.982 |
| vehicle type_Passenger Car/automobile | -2.4450 | 3.201 | -0.764 | 0.445 | -8.729 | 3.839 |
| vehicle type_Passenger Van | 0.1005 | 5.688 | 0.018 | 0.986 | -11.065 | 11.266 |
| vehicle type_Pickup Truck | -9.4119 | 7.273 | -1.294 | 0.196 | -23.687 | 4.863 |
| vehicle type_Suv (sport Utility Vehicle) | -6.3560 | 4.726 | -1.345 | 0.179 | -15.632 | 2.920 |
| Region_Midwest | -6.0421 | 8.863 | -0.682 | 0.496 | -23.440 | 11.356 |
| Region_Northeast | -9.2636 | 6.807 | -1.361 | 0.174 | -22.625 | 4.098 |
| Region_South | -0.7646 | 3.498 | -0.219 | 0.827 | -7.631 | 6.101 |
| Region_West | -11.6443 | 15.190 | -0.767 | 0.444 | -41.462 | 18.173 |
| person type_Bicyclist | 8.2567 | 6.855 | 1.205 | 0.229 | -5.199 | 21.712 |
| person type_Driver | -0.8723 | 4.527 | -0.193 | 0.847 | -9.758 | 8.013 |
| person type_Passenger | -1.0911 | 4.863 | -0.224 | 0.823 | -10.637 | 8.455 |
| person type_Pedestrian | 9.5546 | 8.172 | 1.169 | 0.243 | -6.487 | 25.596 |
| Omnibus: | 198.884 | Durbin-Watson: | 1.006 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 406.420 |
| Skew: | 1.344 | Prob(JB): | 5.59e-89 |
| Kurtosis: | 5.115 | Cond. No. | 5.06e+16 |
getRF(a)
F1 Score: 0.614